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I. Introduction 
The principle of forecasting comes from the idea that data can be predicted prior to the real event 
occurring by identifying underlying patterns in a dataset. Forecasting has become possible by 


analyzing statistics evident within datasets that have one or more variables. 


As research into forecasting has progressed over time, models such as neural networks and 
piecewise linear models have been published to the world wide web. These models have allowed 


for individuals to test and automate forecasting datasets. 


For this specific research document, the SARIMA and PROPHET model will be investigated and 
compared in both performance and accuracy. Both models have been a popular choice for both 
individuals and businesses to forecast stocks and climate change. The SARIMA model works as 
a conjunction between multiple linear models that works as weight for each forecast completed. 
The PROPHET model also works by connecting simpler models together to process an accurate 


forecast. 


The dataset in interest for this investigation is going to be the total number of new COVID-19 
cases that has occurred from January 3rd, 2020, till May 3rd, 2023, in South Korea. Prior to 
finalizing the decision to using the COVID-19 dataset, there was a dilemma on whether climate 
change should be used instead due to its longevity compared to COVID-19. With the goal of 
acknowledging the growing pandemic and its direct impact on humanity, climate change was 


disregarded as it had less urgency and attention compared to COVID-19. 


To summarize, the research will be answering, “What is the relative forecasting accuracy of 
SARIMA and PROPHET models for daily COVID-19 cases in South Korea?” Throughout this 


research, an analysis and understanding of what both models will be completed. While doing so, 


а method of utilizing these two models would be needed to understand how to forecast and test 


the relative accuracies that it has produced. 


II. Literature Review 

2.1.1 Non-seasonal ARIMA Model 

The ARIMA model is made up of three components: auto-regressive (AR), integrated (Т), 
moving-average (MA). These components are individual models that are assigned the variable p, 
d, and q. The parameters can be changed accordingly to fit the model and tested for its accuracy 
using the Akaike’s Information Criterion (AIC), a branch of Akaike’s Information Criterion 


(AICc), and Bayesian Information Criterion (BIC) (Smith 6). 


AR is assigned to p, I is assigned to d, and MA is assigned to q. The p value indicates the 
number of lag observations. The lag observations shows us how many past values of the variable 


are used to predict the current value of the variable (Hyndman and Athanasopoulos 8.2, 8.3). 
Yt = C +Фуу,—-1 + Ó2yt-2 + + Фуу -› + £t 
Equation 1: Mathematical expression for autoregressive models (Hyndman апа Athanasopoulos 8.3) 


The d value identifies the degree of differencing. Depending on the degree of differencing, it 
will tell us the number of times the current value is subtracted from the previous value 


(Hyndman and Athanasopoulos 8.1). 
yt = Yt — Ука 
ук = yt - уа = (у — Ур) Ол-1—Ж-2) = yt — 2У‹-1+У-2 
Equation 2: Mathematical expression for differencing (Hyndman and Athanasopoulos 8.1) 
The q value identifies the order of the moving average which is used to add weights to 


compensate for the errors made which would alter the forecast accordingly (Hyndman and 


Athanasopoulos 8.4). 


Yt = CHE, + 0181 + 05€, 2 + + Og беа 
Equation 3: Mathematical expression for the moving average model (Hyndman and Athanasopoulos 8.4) 


Hence, the formula for the non-seasonal ARIMA model can be derived by combining these three 


components into one equation. 
a 
Yt = С + фуга ++ Фруср + 0164 ++ Og Erg + ЕЕ 
Equation 4: Mathematical expression for the ARIMA model (Hyndman and Athanasopoulos 8.5) 


As previously mentioned, AIC acts as an estimator for predicting the prediction error. Therefore, 
this can be used to test the accuracy and validity of the model for model selection (BURNHAM 
and ANDERSON). However, the AIC is under the risk of overfitting and underfitting which 
would lead to false data results. Therefore, to minimize the issue of overfitting and underfitting 
in small sample spaces, АІСс can be utilized. However, due to the increase in complexity of the 
AIC formula, it is more difficult to compute (BURNHAM and ANDERSON). The BIC’s 
purpose is like an AIC, it works as an estimator and the lower the value, the more accurate it is 


(BURNHAM and ANDERSON). 


2.1.2 Seasonal ARIMA Model 
The SARIMA (Seasonal ARIMA) is an extension of the ARIMA model. The SARIMA model 
would detect seasonality and trends that can be found throughout the dataset. Additional 


parameters of the SARIMA are P, D, Q, and m (Hyndman and Athanasopoulos 8.9). 


The P value shows the order of the seasonal autoregressive component. This means how many 


previous values within a season are used to predict the current value. 


The Р value is the degree of seasonal differencing, which shows how many times the data is 


differenced at a seasonal lag to make it stationary. 


Yt = Yt — Yt-m 


ТО... , ГА ыша -- 
YE = Yt Ут = Ot Уст)- Wt-m — Ye-2m) = Yt — 2У-т%У:-2т 
Equation 5: Mathematical expression for seasonal differencing (Hyndman and Athanasopoulos 8.1) 


The Q value is the order of seasonal moving average, which shows how many previous errors are 


used as weights in a season to predict the current value (Hyndman and Athanasopoulos 8.9). 


To determine what seasonal parameters are the most optimal for the SARIMA, it is ideal to see 
the seasonal lags of both partial autocorrelation function (PACF) and autocorrelation function 
(ACF). ACF measures how much the time series correlates with itself at different lags. For 
example, if the ACF at lag 12 has a high frequency, then the value of the time series at a given 
time point is like the value 12 lags ago. PACF measures how much the time series correlates 
with itself at different lags after removing the effect of previous lags. For example, when the 
ACF reads at lag 12 as high, the value of the time series at a given time point is similar to the 
value 12 lags ago after accounting for the values in between. When the ACF shows a gradual 
decay and Ше PACF shows a sharp cut off after a certain, it suggests that an AR component at 
that lag. If the ACF shows a sharp cut off and the PACF shows a gradual decay, it suggests an 
MA component at that lag. If either plot shows a significant spike at the seasonal lag, it suggests 


a seasonal component at that lag (Hyndman and Athanasopoulos 8.9). 


2.2 РКОРНЕТ Моде! 
The PROPHET model is made up of three components: trend (g(t)), seasonality (s(t)), holidays 
(h(t)), and the error term. These components are then combined to generate a formula for 


forecasting. 


The trend component works to model the nonperiodic changes in the value of the time series 
(Taylor and Letham 39). The component can be either linear or logistic depending on the growth 
parameter. A linear trend is a slope from one point to another. When the linear slope has a 
change in direction it is often referred to as the changepoint. A logistic trend is a curved line that 
eventually approaches a limit, in other words it is called the carrying capacity, a maximum point 
which the forecast can reach. The mathematical formula for this component can be simplified 


into: 


(t) = kt x n if growth = ‘linear’ 
T Ure if growth = Подвис 


Equation 6: Mathematical expression for trend component (Taylor and Letham 40) 


C is the carrying capacity, k is the initial growth rate, and m is an offset parameter (Taylor and 


Letham 40). 


The seasonality component models the periodic changes which can be weekly and yearly (Taylor 
and Letham 41). The component consists of the Fourier series to provide a dynamic model of 
periodic effects, for annual data Р = 365.25, for weekly data P = 7. The mathematical formula 


for this component can be simplified into: 


s(t) = S i: (>) + b, sin (=) 


Equation 7: Mathematical expression for seasonality component (Taylor and Letham 41) 
Р is the period expected for the time series to have, N 15 the order of the Fourier series, and а, 


and b, are weights to be estimated (Taylor and Letham 41). 


The holidays component works to eliminate the effects of holidays as they could sometimes 
generate irregularities (Taylor and Letham 41). The component sets a range of dummy variables 
for each holiday which the user adjusts with dates. The mathematical formula for this component 


can be simplified into: 


H 
h(t) = 52 И1а(=а; 
1 


Equation 8: Mathematical expression for holidays component (Taylor and Letham 41) 
Н 15 the number of holidays, у; is the magnitude of the holiday effect, d(t) is the date of time, 


and а; is the date of holiday (Taylor and Letham 41). 


The error terms identify points in data changes that are unique and doesn’t follow a trend (Taylor 
and Letham 44). The component acts as a weight for the forecast to adjust the forecast generated 


based on the previous value that it has forecasted. 
&(%) = Е[ФСТ, h)] 
Equation 9: Mathematical expression for error term (Taylor and Letham 44) 


The А value is used to represent the error made at a horizontal forecast, and the Т shows the last 


point of historical data used to fit the model (Taylor and Letham 44). 


2.3 Comparison 

When fitting the SARIMA model, there needs to be a total of six parameters filled out and 
evaluated based on the performance of the forecast. This can be computationally expensive and 
time consuming to be complete. Creating a list of parameters for the SARIMA to run through 
and test is a solution to the time consumption of manually modifying the parameters, but it does 
not help improve the computational expense of testing every parameter. On the other hand, the 
PROPHET model does not require the modification of parameters but only needs adjustments to 
its holidays and seasonality. The PROPHET model already automatically determines the best fit 
of the model based on its training data. Therefore, the PROPHET is more intuitive than 


SARIMA. 


Another comparison between the two models is that while the SARIMA assumes that the 
seasonality of its data is continuous, the PROPHET can generate forecasts based on multiple 
seasonalities. Based on this, an assumption can be made that the PROPHET will perform better 
for long-term forecasts than the SARIMA, but the SARIMA will perform better for short-term 


forecasts. 


2.4 Relevant Studies 

The first study has tested for the forecasting of seasonal influenza in Mainland China from 2005 
to 2018 by utilizing the SARIMA model. Results for this experiment showed that the model 
fitted the seasonal fluctuation well with the predicted relative errors from 0.0010 to 0.0137 
(Cong et al.). For example, when the relative error for July 2018 is 0.001, the predicted value of 


1.65 is very close to the actual value, 1.64. 


The second study investigated a minimalistic approach for evapotranspiration (ET) by using the 


PROPHET model. For comparison, the stochastic volatility (SVT) model was used against the 


PROPHET model. Results showed that the PROPHET model generally performed better in high 
rainfall scenarios while the SVR model was more suitable for low rainfall scenarios (Hosono et 
al.). This may have been due to the PROPHET model being more robust to outliers in the data 
which may have been more common in high rainfall scenarios. There may also have been 


missing values and data gaps which the PROPHET model is able to fill out. 


The third study investigated the forecasting of the air pollution in the city of Bhubaneswar 
located in India by comparing the SARIMA and PROPHET model. The approach to comparing 
the performance of these two models was by measuring their performance through root mean 
squared error (RMSE) and mean squared error (MSE). Results revealed that both models have 
provided a good quality of accuracy. However, the PROPHET model with a logarithmic data 


transformation did perform the best with the lowest RMSE and MSE value (Rani Samal et al.). 


Ш. Methodology 

3.1 Data Collection 

The data required for this research was gathered from WHO. They provide data for the number 
of daily/total cases and vaccination per country in relation to COVID-19. For this experiment, as 
the investigations involves the accuracy in forecasting the number of daily COVID-19 cases, the 
dataset with the title of ‘Daily cases and deaths by date reported to WHO’ will be used. The 
dataset is made up of 8 columns: Date_reported, Country_code, Country, WHO_region, 


New cases, Cumulative cases, New deaths, Cumulative deaths. 


Based on the trial that was run, the data was split. Different trials had different numbers of 
training and testing data. The testing data would be required to test the accuracy of the forecasts 


being made. 


3.2 Notes 

To build the SARIMA model and the PROPHET model, a fit and forecast method has been used. 
For the SARIMA, the pmdarima, developed by Taylor G Smith and Aaron Smith with other 
external contributor, has helped automate the process of building the forecast. For the PROPHET 


model, it already has a built-in automatic fit and forecast method provided by TensorFlow. 


3.3.1 ARIMA Data Processing 
To use ARIMA, the data needs to be steady, which means the data should not fluctuate too much 
over time. But many real data are not steady, because they have patterns or cycles. It is possible 


to make these kinds of data steady through different methods of transformation. 


This program uses two ways of changing the data: Box-Cox and log. Box-Cox makes non- 
normal distributions into normal distributions. Log is a type of Box-Cox that makes the data less 


tilted and less wide by using Euler’s number, e. 


It is important for many math problems and models that the data are like a bell shape. It means 
the data have one peak and two sides that are the same. Normaltest from pmdarima checked if 
the data were like a bell shape after changing them. It measures how much the data are like a bell 
shape and gives a number. A small number (usually less than 0.05) means the data are not like a 


bell shape. 


To find the best value of d, three ways of testing Ше data were used: Kwiatkowski-Phillips- 
Schmidt-Shin (KPSS), Augmented Dickey-Fuller (ADF), and Phillips-Perron (PP). They check 
if the data have something that makes them not steady. The KPSS test says the data are steady if 
they have a line but no curve (Shin and Schmidt). The ADF and PP tests say the data are steady 


if they do not have something that makes them change over time (Cheung and Lai) (Breitung and 


Franses). If the KPSS test says no and the ADF or PP test says yes, it means the data have а 
curve and need to be taken away. If the KPSS test says yes and the ADF or PP test says no, it 
means the data do not have a curve and do not need to be taken away. The best value of d is the 


smallest number of times to take away the data that make all three tests agree on being steady. 


ACF Frequency 


ШЇ 


Figure 1: Logarithmic transformation to data for normalizing (75:25) (Author own) 


Figure one shows the spread of the frequency after having the logarithmic transformation applied 


to its data. Overall, the data has been able to achieve a singular large peak with fluctuations. 
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Figure 2: BoxCox transformation to data for normalizing (75:25) (Author own) 


Figure two also shows the spread of the frequency after having the ВохСох transformation 
applied to its data. Overall, while the data has been able to achieve an unbalanced normal shape 


which may lead to a decrease in forecasting accuracy. 


By comparing both results of transformation, the Logarithmic transformation should 


theoretically have a better forecast than the data with the BoxCox transformation. 


3.3.2 SARIMA Model 

Pmdarima already has two methods for endogenous and exogenous variables. The BoxCox and 
logarithm transformation are for endogenous transformations, and the DataFeaturizer and 
FourierFeaturizer are for exogenous transformations. For this experiment, the endogenous 
transformation was used as the dataset used does not consider any other variables that may have 
affected the data, it is a univariate data. It is purely just looking at the increase and decrease in 
the number of COVID-19 cases. If the experiment was to use exogenous transformation, the 
model would have considered outside variables that are not existent on the dataset and their 


effects. 


To accomplish the most optimal results, both methods for the endogenous transformation were 
utilized into the SARIMA model and compared to find the superior result. The automatic 
parameter detector package from the pmdarima was utilized as it eliminates the need to manually 
change the parameter values. The package determined the most optimal parameters by 
calculating the AIC value each time. Once it found the lowest AIC value on a set of parameters, 


it calculated the MAE, MdAPE, and RMSE values. 


3.4.1 PROPHET Data Processing 

One of the main steps in preparing the data for the prophet model was to filter out the dates that 
had irregular effects on the time series. If the data has irregularities, the model might not be able 
to capture the true patterns and make accurate predictions. Therefore, the data was cleaned by 
removing the dates that were known to have irregular effects. The irregular effects were 
determined by the sudden increase or decrease in the data as they would act as weights for the 
PROPHET model (Taylor and Letham). These dates were then given as a list of holidays or 


outliers parameters in the prophet model, which made the model skip them when fitting the data. 


3.4.2 PROPHET Model 

The PROPHET model does not require mathematical transformations for it to fit and forecast 
data. Instead, it can automatically determine the seasonality and parameters necessary in the 
process of feeding the data into the model (Meta). All that was required was to simply define the 
model and have it fit with the data that has been processed beforehand. After the forecasted 


results, the MAE, MdAPE, and RMSE were calculated to evaluate the accuracy of the forecast. 


IV. Results 


4.1 SARIMA Results 


SARIMA (BoxCox) SARIMA (Log) 


80% (Training) 20% (Forecast) | ARIMA(2,1,2)(2,0,2)[7] | ARIMA(2,1,3)(2,0,2)[7] 


MAE 60279.269 54900.033 
MdAPE 4.897 4.451 
SMAPE 123.37 122.353 


RMSE 69505.634 63122.743 


75% (Training) 25% (Forecast) 


ARIMA(2,1,2)(2,0,2)[7] 


ARIMA(2,1,3)(2,0,2)[7] 


MAE 41903.115 36198.114 
MdAPE 2.589 2.101 
SMAPE 106.078 102.013 
RMSE 50737.435 45327.182 


70% (Training) 30% (Forecast) 


ARIMA(2,1,2)(2,0,1)[7] 


ARIMA(2,1,3)(2,0,2)[7] 


MAE 887210.371 31716.685 
MdAPE 27.245 0.966 
SMAPE 135,183 97.811 
RMSE 1221044.958 47788.971 


Table 1: 80:20, 75:25, 70:30, Results of SARIMA Model (Author own) 


Based on the metric evaluations it can be observed that 75% training data and 25% testing data 
performed most optimally compared to the other ranges of data. When comparing the two data 
transformation SARIMA models, it is evident that the logarithm data transformation has 


performed better in all metrics. 
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Figure 3: SARIMA(Log) Result (75:25) (Author own) Figure 4: SARIMA(BoxCox) Result (75:25) (Author own) 


Figure three and four shows the forecasts being displayed. The blue line indicates the real data, 
and the green line shows the forecast. Based on the graph, it is ideal to be more careful when 


handling forecasts that goes on for a long time. 


4.2 PROPHET Results 


MAE MdAPE SMAPE RMSE 
80% (Training) | 58686.53 4.568 124.548 66476.6 
20% (Forecast) 
75% (Training) 127105.3 7.913 143.054 135686.54 
25% (Forecast) 
70% (Training) | 32370.36 1.002 102.414 50397.04 
30% (Forecast) 


Table 2: 80:20, 75:25, 70:30, Results of PROPHET Model (Author own) 


Results show that the PROPHET model performed most optimally when given 70% training data 


and 30% testing data. 
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Figure 5: PROPHET Result (75:25) (Author own) 


Figure five shows the forecast of the model with the inclusion of upper and lower error margins. 
Based on the forecast, the forecast is increasing exponentially in order to account for the large 


spike in data occurring at around 2022 March. 


4.3 Comparing Results. 


SARIMA (Log) PROPHET 


80% (Training) 20% (Forecast) | ARIMA(2,1,3)(2,0,2)[7] 


MAE 54900.033 58686.53 


MdAPE 4.451 4.568 
SMAPE 122.353 124.548 
RMSE 63122.743 66476.6 
75% (Training) 25% (Forecast) | ARIMA(2,1,3)(2,0,2)[7] 

MAE 36198.114 127105.3 
MdAPE 2.101 7.913 
SMAPE 102.013 143.054 
RMSE 45327.182 135686.54 
70% (Training) 30% (Forecast) | ARIMA(2,1,3)(2,0,2)[7] 

MAE 31716.685 32370.36 
MdAPE 0.966 1.002 
SMAPE 97.811 102.414 
RMSE 47788.971 50397.04 


Table 3: Compared result between SARIMA and PROPHET (Author own) 


Results show that the SARIMA(Log) has outperformed the PROPHET model in all data ranges. 


V. Discussion 

There were several limitations to this study. First, the data did not show a consistent seasonality 
throughout the years in South Korea. While there were ‘waves’ of COVID-19 cases occurring, 
they seemingly happened to occur during unprecedented times. For further testing in seasonality, 
the comparison between the number of cases occurring daily in America and South Korea was 
conducted. The data did not show any sort of correlation in trend, both countries were 
experiencing unique waves of the COVID-19. Secondly, it may be difficult to generalize the 


results from South Korea to the world as it seemed that all countries around the world 


experienced different effects of the COVID-19. However, as both models performed well, they 


could be used to help predict the number of cases within South Korea. 


Prior to building the SARIMA model, the ARIMA model was used and tested but returned 
inaccurate forecasts which made the experiment not fair for comparison. Therefore, further 


research was conducted to use the SARIMA model and help improve the forecast. 


VI. Conclusion 

A further extension of the SARIMA model from the ARIMA model is the SARIMAX model. To 
summarize, the SARIMAX model has an additional component ‘X’ which accounts for 
exogenous variables. This component allows for the model to account for external variables that 
may have possible implications to the data. This in turn helps the model make more accurate 
forecasts than it could with just a single variable. While the ARIMA and SARIMA model are 


both univariate models the SARIMAX model is а multivariate model (Arunraj et al.). 


Therefore, a continuation of this research could be completed with the usage of the SARIMAX 
model and another forecasting model such as the long-short term memory (LSTM) model or the 
light gradient boosting machine (LightGBM) model. These models are more complex than the 


models used in the research which would hypothetically return forecasts that are more accurate. 


Overall, the usage of the SARIMA and PROPHET model for this experiment had been a success 
despite the limitations to the point where it was possible for real-life application. However, as 
both forecasting models were only forecasted on a 3-year record of data, it would require for a 


routine update on the dataset and modifications of parameters to be used for practical. 


Further thoughts on the application of the two models has brought the idea to test these models 


for other applications such as stocks or climate changes. Stocks and climate changes have a 


larger dataset as they have recorded for over a decade. The enlargement in data would most 
definitely help improve the performance of the models and perhaps give different results as of 


this experience. 
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УШ. Appendix 

SARIMA (BoxCox): box-sarima.py 
import numpy as np 

import pandas as pd 


import matplotlib.pyplot as plt 


import pmdarima as pm 


from pmdarima.model selection import train test split 


print(f'Using pmdarima (pm. version |") 


df = pd.read csv('south-korea-gathered-data.csv'") 


print(df.head()) 


dataSize = len(df) 


train size — int(0.75 * dataSize) 


y train = df['New cases'][:train size] 


y test = df['New cases'][train size:] 


from pmdarima.utils import tsdisplay 


from pmdarima.preprocessing import BoxCoxEndogTransformer 


y train bc, = BoxCoxEndogTransformer(Imbda2-1e-6).fit transform(y train) 


tsdisplay(y train bc, lag max-100) 


from scipy.stats import normaltest 


print(normaltest(y train Бс) 1) 


from pmdarima.pipeline import Pipeline 


fit2 — Pipeline([ 


('boxcox', BoxCoxEndogTransformer(Imbda2-1e-6)), 


('arima', pm. AutoARIMA(trace-True, 


suppress warnings- True, 


m-7, 
seasonal- True, 


seasonal test-'ocsb', 


)) 


fit2.fit(y_train) 


print(fit2.summary()) 


from sklearn.metrics import mean_squared_error as mse 


def plot_forecasts(forecasts, title, figsize=(8, 12)): 


х = np.arange(y_train.shape[0] + forecasts.shape[0]) 


fig, axes = plt.subplots(2, 1, sharex=False, figsize=figsize) 


axes[0].plot(x[:y_train.shape[0]], y_train, c='b') 
axes[0].plot(x[y_train.shape[0]:], forecasts, c='g') 
axes[0].set xlabel(f New Cases (RMSE={np.sqrt(mse(y_test, forecasts)):.3f])") 


axes[0].set_title(title) 


resid = y_test - forecasts 

_, p = normaltest(resid) 

axes[1 J.hist(resid, bins=15) 

axes[1 J.axvline(0, linestyle='--', c='r') 


axes[1].set title(fResiduals (р- (p:.3f])") 


plt.tight_layout() 


plt.show() 


forecasts = fit2.predict(y test.shape[0]) 


plot forecasts(forecasts, title-'Box-Cox transformed ARIMA') # Added this line 


from sklearn.metrics import mean absolute error as mae 
from sklearn.metrics import mean absolute percentage error as mape 
from sklearn.metrics import median absolute error as mdae 


from pmdarima.metrics import smape 


mae value — mae(y test, forecasts) 

mdape value = mdae(y test, forecasts) / np.median(y test) 
smape value = smape(y test, forecasts) 

mape value = mape(y test, forecasts) 


rmse value — np.sqrt(mse(y test, forecasts)) 


print(f MAE: (mae value:.3fj") 
print(f MdAPE: (mdape value:.3fj") 
print(f SMAPE: (smape value:.3f]") 


print(f MAPE: {таре value:.3fj") 


print(fRMSE: (rmse value:.3fj") 


SARIMA (Log): log-sarima.py 
import numpy as np 


import pandas as pd 


import matplotlib.pyplot as ри 


import pmdarima as pm 


from pmdarima.model selection import train test split 


print(f'Using pmdarima (pm. version |") 


df = pd.read csv('south-korea-gathered-data.csv") 


print(df.head()) 


dataSize = len(df) 


train size — int(0.7 * dataSize) 


y train = df['New cases'][:train size] 


y test = df['New cases'][train size:] 


from pmdarima.utils import tsdisplay 


from pmdarima.preprocessing import LogEndogTransformer 


y train log, = LogEndogTransformer(Imbda-1e-6).fit transform(y train) 


tsdisplay(y train log, lag max-100) 


from scipy.stats import normaltest 


print(normaltest(y train log)[1]) 


from pmdarima.pipeline import Pipeline 


fit3 = Pipeline([ 
(‘log', LogEndogTransformer(Imbda-1 e-6)), 
('arima', pm. AutoARIMA(trace-True, 
suppress warnings- True, 
m-7, 
seasonal- True, 


seasonal test-'ocsb', 


)) 


fit3.fit(y train) 


print(fit3.summary()) 


from sklearn.metrics import mean squared error as mse 


def plot forecasts(forecasts, title, figsize=(8, 12)): 


x-np.arange(y train.shape[0] + forecasts.shape[0]) 


fig, axes = plt.subplots(2, 1, sharex=False, figsize=figsize) 


axes[0].plot(x[:y_train.shape[0]], у train, c='b') 
axes[0].plot(x[y_train.shape[0]:], forecasts, c='g') 
axes[0].set xlabel(fNew Cases (RMSE={np.sqrt(mse(y_test, forecasts)):.3f])") 


axes[O].set title(title) 


resid — y test - forecasts 

_, p = normaltest(resid) 

axes[1 J.hist(resid, bins=15) 

axes[1 J.axvline(0, linestyle='--', c='r') 


axes[1].set title(fResiduals (р- (p:.3f])") 


plt.tight_layout() 


plt.show() 


forecasts log — fit3.predict(y test.shape[0]) 


plot forecasts(forecasts log, title-'Log transformed ARIMA") 


from sklearn.metrics import mean absolute error as mae 
from sklearn.metrics import mean absolute percentage error as mape 
from sklearn.metrics import median absolute error as mdae 


from pmdarima.metrics import smape 


mae value log = mae(y test, forecasts log) 

mdape value log = mdae(y test, forecasts log) / np.median(y test) 
smape value log = smape(y test, forecasts log) 

mape value log = mape(y test, forecasts log) 


rmse value log = np.sqrt(mse(y test, forecasts log)) 


print(f MAE: {тае value log:.3fj") 
print(f MdAPE: (mdape value log:.3f]") 
print(f SMAPE: (smape value log:.3fj") 
print(fMAPE: { таре value 1о6:.3Ғ)”) 


print(f RMSE: (rmse value log:.3f]) 


PROHPHET: prophetandseason.ipynb 
import numpy as np 
import pandas as pd 


import matplotlib.pyplot as plt 


from sklearn.metrics import mean absolute error, mean absolute percentage error 


from sklearn.metrics import median absolute error as mdae 
from prophet import Prophet 


from prophet.plot import plot plotly, plot components plotly 


spikes — pd.DataFrame([ 


(holiday': 'spike 1", "45": "2022-01-09", lower window": 0, "45 upper': '2022-06-26'}, 


{(holiday': 'spike 2', "45": "2022-07-03", 'lower_window': 0, "45 upper': '2022-09-02'}, 
1) 
fort colin ['ds','ds upper']: 

spikes[t col] = pd.to datetime(spikes[t col]) 
spikes['upper window'] = (spikes['ds upper'] - spikes['ds']).dt.days 


spikes 


df = pd.read csv('south-korea-gathered-data-prophet.csv") 


ағ Һеаа() 


total rows = df.shape[0] 

train rows = int(total rows * 0.7) 
test rows — int(total rows * 0.3) 
train = df.iloc[:train rows] 


test = df.iloc[-test rows:] 


y. test = test['y']. values 


m2 — Prophet(holidays-spikes) 
m2. fit(train) 
future2 = m2.make future dataframe(periods-365) 


forecast2 — m2.predict(future2) 


m2.plot(forecast2) 


plt.axhline(y=0, color="red’) 


plt.title('Spikes as one-off holidays') 


plt.show() 


m2.plot components(forecast2) 


y pred = forecast2['yhat'].values[-test rows:] 


rmse = np.sqrt(np.mean((y test- y pred)**2)) 


print(f The RMSE value is {rmse:.2f}') 


mae = mean absolute error(y test, y pred) 


print(f The MAE value is {mae:.2f}') 


mape = mean absolute percentage error(y test, y pred) 


print(f The МАРЕ value is {mape:.2f}') 


mdape value = mdae(y test, y pred) / np.median(y test) 


print(f The MdAPE value is {таре value:.3fj") 


from pmdarima.metrics import smape 
smape value = smape(y test, y pred) 


print(f The SMAPE value is {smape_value:.3f}') 


Data 1: south-korea-gathered-data.csv 
date | Country | Coun | WHO ге | New ca | Cumulative | New de | Cumulative d 
code try gion ses cases aths eaths 

1/3/20 | KR Repub | WPRO 0 0 0 0 
20 lic of 

Korea 
1/4/20 | КК Repub | WPRO 0 0 0 0 
20 lic of 

Korea 
1/5/20 | КК Repub | WPRO 0 0 0 0 
20 lic of 

Korea 
1/6/20 | KR Repub | WPRO 0 0 0 0 
20 lic of 

Korea 
1/7/20 | KR Repub | WPRO 0 0 0 0 
20 lic of 


Korea 


Data 2: south-korea-gathered-data-prophet.csv 


ds Country c | Count | WHO reg | y | Cumulative c | New dea | Cumulative de 
ode ry ion ases ths aths 
1/3/20 | KR Republ | WPRO 010 0 0 
20 ic of 
Korea 
1/4/20 | KR Republ | WPRO 010 0 0 
20 ic of 
Korea 
1/5/20 | KR Republ | WPRO 0/0 0 0 
20 ic of 
Korea 
1/6/20 | KR Republ | WPRO 010 0 0 
20 ic of 
Korea 
1/7/20 | KR Republ | WPRO 010 0 0 
20 ic of 
Korea 
Data 3: usa-gathered-data.csv 
Date_rep | Country_ | Coun | WHO re | New_c | Cumulative | New_de | Cumulative_ 


orted 


code 


gion 


ases 


_cases 


aths 


deaths 


1/3/2020 


US 


States 


of 


AMRO 


0 


1/4/2020 


US 


States 


of 


AMRO 


1/5/2020 


US 


States 


of 


AMRO 


1/6/2020 


US 


States 


of 


AMRO 


1/7/2020 


US 


AMRO 


